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We consider two scenarios of multiclass online learning of a hypothesis class H C 
Y-^ . In the full information scenario, the learner is exposed to instances together 
^ with their labels. In the bandit scenario, the true label is not exposed, but rather an 

indication whether the learner's prediction is correct or not. We show that the ratio 
i_J between the error rates in the two scenarios is at most 8 • \Y\ ■ log(|y|) in the realizable 

c/^ case, and 0( Y^|y[) in the agnostic case. The results are tight up to a logarithmic factor 

I ^1 and essentially answer an open question from Daniely ci al. (2011). 

We apply these results to the class of 7- margin multiclass linear classifiers in M^. 
^ We show that the bandit error rate of this class is G I in the realizable case and 

O (2008). 

^ 1 Introduction 

>■ Online multiclass classification is an important task in Machine Learning. In its basic form, 

*^ which we refer as the full information scenario, the learner is required to predict the label of a 

^ new example, based on previously observed labeled examples. Recently, the bandit scenario 

has received much attention (e.g. Aucr ct al. (2003), Kakade et al. (2008), Dani et al. 
(2008), Auer et al. (2002)). Here, the learner does not observe labeled examples, but rather, 
it observes unlabeled examples, predicts their labels and only receives an indication whether 
his prediction was correct. The relevance of the bandit scenario to practice is evident - a 
canonical example is internet advertising, where the advertiser chooses a commercial (which 
is thought as a label) upon the information it has on the user (which is thought as an 
instance). After choosing a commercial, the advertiser only knows if the user has clicked the 
commercial or not. 

Let X be the instance space and Y the label space. Denote /c = |y |. To evaluate learning 
algorithms, it is common to compare them to the best hypothesis coming from some fixed 



G y^|y|r^ in the agnostic case. This resolves an open question from Kakade et al. 
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hypothesis class H C . We define the error rate of H as the least number, YjIi h{T), for 
which some algorithm is guaranteed to make at most Erri^(T) mistakes more than the best 
hypothesis in if, when running on a sequence of length T. We emphasize that we consider 
all algorithms, not only efficient ones. 

It is clear that learning is harder in the bandits scenario. The purpose of this work is to 
quantify how larger is the error rate in this scenario. Our main results show that, for every 
hypothesis class if, the error rate in the bandit scenario is only 0{k) times larger in the 
realizable case (i.e. in the case that some hypothesis H makes no mistakes) and 0{\/k) times 
larger in the general (agnostic) case. We note that our results hold also for the multiclass 
multi-label categorization, where a set of labels are allowed to be correct. As an application, 
we use our results to quantify the error rate of the class of large margin halfspaces classifiers. 

1.1 Related Work 

Cardinality based vs. Dimension based bounds. The celebrated result of Cesa- 
Bianchi et al. (1997) shows that, in the full-info scenario, the error rate is upper bounded 
by O ( v/log( I ii I )T). In the fuU-info-reahzable case, the majority algorithm achieves an error 
rate of 0(log(|ii|)). 

These two bounds are tight for several hypothesis classes. However, there are several 
important classes for which much better error rates can be achieved. For example, those 
bounds are meaningless for infinite hypothesis classes. However, several such classes (e.g. 
the class of large margin halfspaces classifiers) do admit finite error rate. 

The reason to those deficiencies is that the quantity log(|if|) does not quantify the true 
complexity of the class, but only upper bounds it. To remedy that, Danicly ct al. (2011), 
following a binary version from Ben-David et al. (2009) and Littlestonc (1988), proposed 
a notion of dimension (a-la VC dimension), called the Littlestone dimension. As shown in 
Daniely et al. (2011), the error rate of H, in the full-info scenario, is e{^L{H)T) in the 
agnostic case and 0(L(if)) in the realizable case^. 

The results of Daniely et al. (2011) show that the term log(|iJ|) in the result of Ccsa- 
Bianchi et al. (1997) and in the bound of the majority algorithm can be replaced by L(if) 
(the algorithms they use are different, however), leading to a tight (up to log factors of T 
and k) characterization of the error rate. Our results can be seen as analogues of these 
results in the bandit scenario: By the algorithm of Aiier ct al. (2003) the error rate of H 
in the bandit scenario is 0{^/kTJog{\H\)). By the Majority algorithm, the error rate in 
the realizable-bandit scenario is 0{klog{\H\)). Our results upper bound the error rates by 
d{^kT\.{H)) and 0(A;L(ii)) respectively. 

Since the Littlestone dimension characterizes the full-info error rate, our results imply 
an upper bound on the ratio between the bandit and full-info error rates. To the best of our 
knowledge, these are the first upper bounds on this ratio that hold for every class. We note 
also that since L(ii') < log(|if|), our bounds, up to log factors, imply the bounds of Auer 
et al. (2003) and the majority algorithm. 

Comparison to other settings. In the statistical/PAC settings, one assumes that the 

"'^A detailed study of online analogs to statistical complexity measures can be found in Raklilin ct al. 
(2010)). 
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sequence of examples is drawn i.i.d. from some distribution on X x "K. In these settings, 
it is not hard to show that the bandit error rate is at most 0{k) times larger than the full- 
info rate (see Section 4 and Danicly ct al. (2011)). Our results generalize these facts to the 
adversarial setting. 

2 Our Results 

2.1 Problem setting and Background 

Settings. Fix an instance space X and a label space Y. Denote Z = X x Y , y = 2^, 
Z = X X y and k = \Y\. We consider two scenarios of multiclass online learning. In the 
full information scenario, at each step t = 1, 2, ... a full-info learning algorithm is exposed 
to an instance Xt G X, predicts a label yt & Y and then observes a list of true labels 
Yf G Y (note that this is little more general than the vanilla multiclass setting in which 
\Yt\ = 1). The prediction yt can be based only on the previously observed labeled examples 
(xi, Fi), . . . , {xt-i,Yt-i) and on xt- The bandit scenario is similar. The sole difference is that, 
after a bandit learning algorithm predicts a label, the true labels are not exposed, but only an 
indication whether the algorithm's prediction was correct or not. Therefore, the prediction 
y-t can be based only on the previously observed unlabeled examples xi, . . . , Xt-i, Xt and on 
previously obtained indications l(yi G Yi), . . . , l(^t-i G l^-i). We assume that the choice 
of the sequence (xj, Yt) is adversarial, but the adversary chooses Yt before the algorithm 
predicts yt. In particular, the algorithm may choose yt at random after the adversary chose 



Let if be a hypothesis class, which might be either class of functions from X to F, or, a 
class of functions from X to M^. We say that a sequence (xi, yi) . . . , (xt, yr) G Z is realizable 
by H if there exists a function h & H such that either VI < t < T, h{xt) = yt, for the case 
that H C Y^, or VI < t < T, hy^{xt) > 1 + max^^y, hy{xt), in the case^ that H C (M^)^. 
We denote by H{T) C the sequences of length T that are realizable by H. We say that a 
sequence (xi, Yi), . . . , (xy, Yr) E Z is realizable by H if there exist yi G Fi, . . . , G It such 
that the sequence (xi,?/i), . . . , {xT,yT) is realizable by H. 

The error of H on a sequence z = ((xi, Yi), . . . , (xr, Yt)) G Z'^ is minimal number of 
errors that a hypothesis from H makes on the sequence z. Namely, 



Let A be a (either full-info or bandit) learning algorithm. Given z G Z^, we denote by 
Err(y4, z) the expected number of errors A makes, running on the sequence z. We define 
the realizable error rate of A w.r.t. H as the worst case performance of A on a length T 

^In some contexts it favourable to use a margin-dependent notion of realization. Namely, to define a 
"/-realizable sequence by requiring that VI < i < hy^(xt) > ^ + ma,Xy^y^ hy{xt). Observing that a sequence 

is 7-realizable by H iff it is realizable by • iJ^, it is easy to interpolate between the two definitions. Our 
choice of the above definition is merely for the sake of clarity. 



Yt. 



T 
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realizable sequence, namely, 



Err^(T) = sup Err(A, z) . 

ze^^, Err(_H",z)=0 

The agnostic error rate of A is its worst case performance over all length T sequences, namely, 

Err^(T) = sup Err(A,z) - Err(i7,z) . 

The realizable and agnostic full-info error rates of the class H are the best achievable error 
rates, namely, 

Err^(r) = inf Err;4(T) and Err^(T) = inf Err^(T) . 

j4 is a full-info alg. A is a full-info alg. 

Similarly, the realizable and agnostic bandit error rates of the class H are 

B-Err^(T) = inf Err;4(T) and B-Err^(r) = inf Err^(T) . 

A is a bandit alg. A is a bandit alg. 

Our main focus is to understand how larger is the error rate in the bandit scenario, com- 
pared to the full-info scenario. Thus we define the agnostic and realizable price of bandit 
information of H by 

' Err^(T) ' Err^(T) 

The Littlestone dimensions. Daniely et al. (2011) (following Ben-David et al. (2009) 
and Littlestone (1988)) defined two combinatorial notions of dimension that characterize the 
error rates of a class H. Let T be a rooted tree whose internal nodes are labeled by X and 
whose edges are labeled by Y. We say that T is L-shattered by H if, for every root-to-leaf 
path xi, . . . , xt, the sequence (xi, j/i), . . . , (xr-i, Vt-i), where yt is the label associated with 
the edge Xt — )■ Xt+i, is realizable by H. The Littlestone dimension of H, denoted L(if), is 
the maximal depth of a complete binary tree'^ that is L-shattered by H. We say that T 
is BL-shattered by H if, for every root-to-leaf path xi, . . . ,xt, if Ut is the label associated 
with Xt Xt+i, then there exists a realizable sequence {xi,y'^), . . . , {xt-i^v't-i) such that 
Vi, y[ 7^ yt- The Bandit Littlestone dimension of if, denoted BL(ii'), is the maximal depth 
of a complete fc-ary tree that is BL-shattered by H. 

Theorem 2.1 (Daniely et al. (2011)) 

• For every class H and for every T > L(i7), |L(if) < Err^(T) < L(if) and 
Q [^/HHyr) < Err?,(T) < O (^^L{H)T \og{kT)') . 

• For every class H, B-Err^(T) < BL(i7). Moreover, for every deterministic bandit 
algorithm, Err^(T) > min{T, BL(if )}. 



■^By a complete binary tree, we mean a tree whose all internal nodes have two children and all leaves are 
at the same depth. 
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The class of large-margin multiclass linear separators. Denote by B'^ the unit ball 
in M'^. We identify every matrix W G M^xdiM) with the linear function it defines on B'^ (i.e. 



X I—)- Wx). Denote by the Frobenius norm of W, namely, = yJ2i=i '^j=i ^ij- 

For D > let W^'^^p) = {W e Mfcxd(M) : \\W\\f < D}. 

A multiclass variant of the Perceptron algorithm (see Crammer and Singer (2003)) makes 
at most 2 ■ mistakes whenever it runs on a sequence that is realizable by W^'^{D). 
Therefore, L{W'^'''{D)) <2-D\ Also, it is not hard to see that mm{d, [D^J} < L{W^'^{D)). 
Thus, we have 

min{d, [D^\ } < L{W^'''{D)) < 2 ■ (1) 

2.2 Results 

Our first result bounds the bandit-realizable error rate in terms of the Littlestone dimension. 

Theorem 2.2 For every hypothesis class H, B-Err^(T) < 4|y| log(|y|) L(i7). 

Together with Theorem 2.1 we conclude that the bandit-realizable error rate is at most 0{k) 
larger than the full-info-realizable error rate. Namely, for every hypothesis class H, 

POB^^(T)<8|r|-log(|r|). (2) 

It is not hard to see (e.g. by Lemma 2) that for H = and T > (|y| — 1) ■ |X|, we have 
that B-Err^(T) > = Mz^iiW. Thus, Theorem 2.2 is tight up to a factor of 

log(|y|). By Theorem 2.1, 

POB'„(T) = ■t^T£> > - <l^'l - ^' 



Er4(r) - UH) 2 

Thus, Equation (2) is tight up to a factor of log(|y|) as well. In Daniely ct al. (2011) it has 
been asked how large can be the ratio ^^p^- It can be easily seen that ^^^^ can be as large 
as \Y\ — 1 (this is true, for example, when X is finite and H = Y^). Theorem 2.2, together 
with Theorem 2.1, shows that ^jjjfy < 4|y | log(|y |), essentially answering the question of 
Daniely ct al. (2011). 

For the agnostic case we show the following result: 

Theorem 2.3 For every class H, B-Err^(T) < e • ^T\Y\L{H) log(T ■ |y|). 

Together with Theorem 2.1, it follows that for every class H, 

POB^(T) = o (v/|F|.iog(|r|-r)) . (3) 

Relying on the construction from section 5 of Aucr ct al. (2003), it is not hard to show that 
for H = Y^ andT > \Y\ ■ \X\ = \Y\ ■ L{H) it holds that B-Err^(r) > ^^L{H)-T- \Y\. 
Thus, together with Theorem 2.1, Theorem 2.3 and Equation (3) are tight up to a logarithmic 
factor of log(|F| ■ T). 

Next, we apply Theorems 2.2 and 2.3 to analyse the bandit error rate of large margin 
multiclass linear separators. 
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Theorem 2.4 For every D > and d,k 

Ett\^,,,^^^{T) <8-k- \og{k) ■ D\ Err;^,..(^)(T) <A-D- ^Tk\og{T ■ k) 
Moreover, for L = min{d, [D^\ } and T > k ■ L, 

Kakade ct al. (2008) have shown an (inefficient) randomized algorithm that makes, w.p. 
1 — (5, at most O [k^D'^hi {^^^^ ■ {hiD + In In ("^"1^))) mistakes, whenever it runs on a se- 
quence (xi, Fi), . . . , (xt, Yt) that is reahzable by W^^^{D). It has been asked there what is 
the optimal error rate, and whether there exists an asymptotically finite bound on the error 
rate that does not depend on the dimension d. Theorem 2.4 answers the second question 
in affirmative and essentially answers the first question. Also, Kakade et al. (2008) have 
conjectured that for fixed D and k, the bandit agnostic error rate of 'W^'^{D) should be 
0{-\/T). Theorem 2.4 validates this conjecture, up to a factor of ^J\og{T). 

The bound in Theorem 2.4 is rather tight when the dimension, d, is larger than the 
complexity D^. To complete the picture, we note that in Kakade et al. (2008) it has been 
shown that B-Err^d,fc(£))(T) < O {k"^ d\og{D)) . Here we show a corresponding lower bound. 

Theorem 2.5 For every D"^ > k'^d and T>^, B-Err;^d,fc(B)(T) > [d/2\ ■ (A; - 1) ■ |. 
2.3 Proof techniques 

The proof of Theorem 2.2 constitutes most of the technical novelty of the paper. The 
algorithm we use belongs to the family of "majority vote" algorithms such as the Standard 
Optimal Algorithms of Littlcstone (1988), Ben-David et al. (2009) and Daniely ct al. (2011). 
These algorithms start with a hypothesis class Hi = H. At each step t, they predict the 
label predicted by "most" hypotheses in Hf, where "most" is quantified in a certain way. 
After an indication is given for that prediction (i.e., after the true label is exposed in the 
full-info scenario or after an indication whether the algorithm's guess was correct or not in 
the bandit scenario), the algorithm constructs Ht+i by throwing away all functions that are 
in contradiction with that indication. 

A notable distinction is that instead of a single hypothesis class, our algorithm keeps a 
collection of hypothesis classes. At each step, each class in that collection is either splited, 
thrown away, or remains untouched. The prediction at each step aims to minimize a measure 
of capacity for collections of hypothesis classes, which we define. We show that this measure 
shrinks to 1 after 4|y| ■ log(|y|) • L(if) mistakes. From that point, the algorithm makes no 
mistakes. 

Theorem 2.3 is based on an argument from Ben-David ct al. (2009) (see also Daniely 
et al. (2011)). We represent each class by a relatively small number of experts and apply the 
result of Auer et al. (2003) on these set of experts. Theorem 2.4 is deduced from Theorems 
2.2, 2.3 and Equation (1). 

To prove Theorem 2.5, we first consider the class H of all functions / : [L|J] x [k] — > [k] 
such that /|{j}x[fc] is a bijection for every j e [LfJ]- We show that B-Err^(r) > [^\-{k—l)-^. 
Then, we adapt a construction from Daniely et al. (2011) to show that H can be realized by 
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3 Proofs 



Throughout, we denote by k the number of labels. We prove Theorems 2.2 and 2.3 for the 
case that H C . The case of real- valued H can be handled along the same lines. We say 
that a hypothesis class Hi is realized by H2, if, VT, ifi(T) C H2{T). It is clear that in this 
case the error rates and the Littlestone dimensions of Hi are no larger than those of H2. 

3.1 Theorem 2.2 

Let 7{ be a collection of non-empty subsets of H. We define its capacity by CiTi) = 
We note that for n = {H}, it holds that C{n) = k'^^^^\ Also, for non- 
empty "H, C{'H) > 1. Our algorithm starts with "Hi = {H}. At each step it modifies l-Lt such 
that (1) C{l-it) shrinks with every false prediction and (2) all hypotheses that are consistent 
with the previously observed instances are in one of the subclasses of l-it . 

Given x G X, ?/ G F and V C H devote = {f E V : f{x) = y}. For a collection, "H, 
of subsets of H we define 



Algorithm 1 

1: Set ni = {H}. 
2: for t = 1, 2, . . . do 

3: receive Xt 

4: Predict y e argmaXy^Y Pnt,xt{y)- 

5: If the prediction is wrong, update Tit+i = X{'Ht,Xt,y)- Otherwise, 7it+i = Tit- 
6: end for 



Claim 1 Algorithm 1 makes less than 4 L(if)fc log(fc) mistakes. 

Proof Fix V El-it and let y G argmaXyi^Y L(V^^'). Since there is at most one y' eY such 



that L(y/) = L(V^) (see Littlestone (1988)), it follows that for every y' ^ y, L(\//) < L{V). 
In particular, V E K{l-Lt,x,y) and 



A(7{,x,yo) = {VEH:yy^ VoMV.!) < LiV)} 



\{n, X, yo) = {yy ■. v e ach, x, yo),y^ yo, vy^(i}}un\ A{n, x, y,) 

PHAy)=C{H)-C{\{H,x,y)) 





l).fc2L(y)-2 
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Thus, 



> E E ik''^^'^^- E '^'^^''"^^ 

Ve% yeY:V€A(nt,xt,y) y'lv/^fH, y'^y 

E 4) ^^'^'^ 



> 



1 - 1) cm.) 

It follows that for some y e Y, Pn.^iv) > Hi - l) C(Ht) > ^CCHt). Thus, if the 

algorithm errs at time t then 

It follows that after 4L(i7)A;log(fc) mistakes it will hold that 

\ 4L(_ff)fclog(fc) 



cm < (1-^) cm) 



< e 



-2L{i/) log(fc)pL(H) 



However, it is not hard to see that each hypothesis which is consistent with the history up 
to time if: — 1 is in one of the classes of l-Lf As we assume that the sequence is realizable, 
there is at least one consistent hypothesis. Thus, for every t, C{'Ht) > 1- It follows that the 
algorithm makes less than A\j{H)k\og{k) mistakes. 



□ 



3.2 Theorem 2.3 

We use a result from Aucr et al. (2003), which we briefly describe next. Suppose that at each 
step, t, before the algorithm chooses its prediction, it observes advices {f{, . . . , /^) G , 
which can be used to determine its prediction. We refer to // as the prediction made by the 
expert i at time t and denote by Li^T = |{t G [T] : fi^t ^ Yt}\ the loss of the expert i at time 
T. For every sequence z G Z'^ , the algorithm from Auer et al. (2003), section 7, makes at 
most miUjgfTv] -^i,T + e^JkT log(A^) mistakes in expectation. 

Suppose that, for every f E H, we construct an expert, Ej, whose advice at time t is 
f{xt). Denote by Lj^t the loss of the expert Ef at time t. Running the algorithm of Aucr 
et al. (2003) with this set of experts yields an algorithm whose agnostic error rate is at most 
e^^/kT\og{\H\). We proceed by imitating this set of experts with a more compact set of 
experts, which will allow us to bound the loss in terms of L(if) instead of log(|if|). 

Let At = {A C [T] | \A\ < L(if)}. For every A e At and : A F, we define an 
expert The expert Ea,4> imitates the SOA algorithm (Algorithm 2 in the appendix) 
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when it errs exactly on the examples {xt \ t & A} and the true labels of these examples are 
determined by (f). The expert Ea,^ proceeds as follows: 

Set Vi = H. 
For t = 1,2. . . ,T 
Receive Xt. 

Set k = argmaxj^gy L({/ e Vt : f{xt) = y}). 

Ute A, Predict (f){t) and update l^+i = {f e Vt : f{xt) = (f){t)}. 

lit ^ A, Predict It and update Vt^^x = {/ G V"t : j{x^ = k}. 



The number of expert we constructed is (I)^"' — iTk)^^^\ Denote the number of 

mistakes made by the expert EA,<j) after T rounds by LA,ci>,T- If we apply the algorithm from 
Auer et al. (2003) with the set of experts we've constructed, we obtain an algorithm that 
makes at most 

min La,^,t + e^JkT L{H) \og{Tk) 

A,(j> 

mistakes in expectation, whenever it runs on a sequence z G . To finish, we show that 
Ymn.A,4,LA,ct>,T < Err (if, 2;). 

Let / G if be a function for which Err (if, 2;) = |{t G [T] : f{xt) ^ Denote by 

A C [T] the set of rounds in which the SOA algorithm errs when running on the sequence 
(xi, /(xi)), . . . , (xt, fixr)) and define (p : A ^ Y hj (pit) = f{xt). Since the SOA algorithm 
makes at most L(if) mistakes, \A\ < L(if). It is not hard to see that the predictions of the 
expert i?A,0 coincide with the predictions of the expert Ef. Thus, 

La,4,,t =\{te[T]: f{xt) ^Yt}\ = Err (if, z) . 

3.3 Theorem 2.4 

The upper bounds follows from Equation (1), together with Theorems 2.2 and 2.3. The 
lower bounds follows from the corresponding bounds for the class [k]^^\ and the fact that 
this class can be realized by W^'^iD): Associate the set [L] with ei, . . . , G M"^, the first L 
vectors in the standard basis of M'^. A function / : {ei, . . . , e^} — )■ [/c] can be realized by the 
matrix W G yV'^'''{D) whose z'th row is J2je[m]:f{j)=i^j- 

3.4 Theorem 2.5 

We shall use the following Lemma. Consider the following game: A r.v., U, is sampled 
uniformly from Y. Then the player, that does not observe U, try to guess U. After each 
prediction, yt, he only receives an indication whether U = yt- 

Claim 2 Let R = \{l < t < \Y\ - I : yt 7^ U}\. Then E[i?] > 

Proof We prove the claim by induction on For |y| = 2 it follows from the fact that 
U is independent from iji. For |y| > 2, we note that, since U is independent from yi, the 
probability that jji ^ U is at least ^^yp- Also, conditioned on the event that jji ^ U ,U and 
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^2,^3) • • • satisfies the requirements of tlie Claim witli tlie label set Y \ {yi}. Thus, by the 
induction hypothesis, 

E[R] > (1 + E[R\y, ^ U]) ■ Pr(yi ^ U) > ( 1 + ^^ 



\Y 




- 1 




- 1 




Y\ 
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□ 

Claim 3 Let H be the class of all functions f : [A] x [k] — )■ [k] such that f\{j}x[k] is a 
bijection for every j G [A] . Then 

B-Err^^(T) > A ■ (k - 1) ■ ^ 

Proof Consider the following algorithm, applied by the adversary: 
1. For j = 1,2..., A 

1.1. For m = 1,2. . . , A; - 1 

1.1.1. Choose yj^rri G [k] \ {yj,i, ■ ■ ■ ,yj,m-i} uniformly at random. 

1.1.2. For n = 1, . . . k — m 

1.1.2.1. Expose the learner the instance {j,m). 

1.2. Let yj^k to be the element in the singleton [k] \ {yj^i, ■ ■ . ,yj^k-i}- 

By Claim 2, for every (j, m), the adversary causes the learner to make > mistakes at the 
predictions for the instance {j, m). Thus, the expected value of the total number of mistakes 
is A ■ Etrii = A ■ (A; - 1) ■ I . Also, it is clear that the function / : [A] x [k] [k] defined 

by f{j,m) = yj^m is in H. Thus, the sequence produced by the adversary is realizable by H. 

□ 

Theorem 2.5 follows from the following claim. 

Claim 4 If D'^ > k^d then the class H from Claim 3 with A = [|J is realized by W^'''{D). 

Proof Let d = \_d/2\. Instead of working in M'^, we work in C^. We identify each {j,m) G 
[d] X [k] with Xj^m '■= e^T^ ■ ej. Here, Cj is the j'th vector in the standard basis of C^. 
Let f E H. We must show that / is induced by some W G W^'''{D) in the sense that 

V(j,m) G [d] X [A;], (WXj^rn) f{j,m) > 1 + raaX {WXj^rn)m' ■ 

Indeed, we let W G j(C) to be the matrix defined by 

V(j,m) G [d] X [kl Wf(^j,m),j = ■ ■ 
We note that for every Vj G [rf] and m,m' G [k], 

= k' ■ (e^,e=1=) = e . cos ( 2vr^^) " ^^^'^'^ 



k 
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Thus, 

{Wx 

j,m) f{j,m) ^ k ■ (1 COs(27r//u)) + 'iI\.clx(\VXj^rn)f{j,rn') ^ 1 "I" Kl&X ( 

Where the last inequahty follows from the fact that by Taylor's Theorem, for x G [0, 27r/3], 

2 4 2 
X) > 7 > -T 

^ - 2 24-4 



□ 



4 Conclusion and future work 

We have bounded the price of bandit information in the setting of hypothesis class based 
on-line learning and extended the results of Auer et al. (2003). We applied our results to 
estimate the bandit error rate of the class of large margin classifiers. 

The focus of this paper is information theoretic. That is, we have ignored time complexity 
issues. It is of interest to study the computational price of bandit information - i.e. how 
the required runtime grows when moving from the full-info to the bandit scenario. It is 
instructive to consider the PAC setting. Given a learning algorithm. A, for a class H in 
the PAC full-info setting we can simply construct a bandit learning algorithm as follows - 
given a sample of unlabled instances, we guess, for each instance, a label from Y, uniformly 
at random. Typically, we will be correct on about of the examples. Thus, we can 

generate a labeled i.i.d. sample whose size is |y|-fraction of the original sample, and run the 
full-info algorithm A on this sample. Using this construction (see Danicly ct al. (2011)), 
it easily follows that in the PAC setting, the price of bandit information, both information 
theoretic and computational, is 0{k). Is this true in the on-line setting as well? We note 
that this question is open and interesting already for the class of large-margin multiclass 
linear separators. 

There is still some room for improvements of the bounds in Theorems 2.2 and 2.3. We 
conjecture that the optimal bounds are that for every class H, B-Err^(T) = 0(|F| ■ L(if)) 
and B-Err^(T) = 0{^y\Y\ ■ L{H)T). 

Theorem 2.3 together with Theorem 2.1 characterize the bandit-agnostic error rate up 
to a factor of 0{\/k). It is of interest to find a tighter characterization. We note that 
Theorem 2.1 shows that the bandit Littlestone dimension characterizes the error rate in the 
bandit realizable case for deterministic algorithms. It is an open question to show that this 
dimension quantifies the error rate also in the agnostic case and for randomized algorithms 
in the realizable case. 
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The SOA algorithm 

For completeness, we outline the SOA algorithm (of Ben-David et al. (2009) and Daniely 
et al. (2011)) for a class H C . 



Algorithm 2 Standard Optimal Algorithm (SOA) 
1: Initialize: Vq = H. 
2: for t = 1, 2, . . . do 

3: receive Xf. 

4: for yeY, let ^//^^ = {/ G V^^, : f{xt) = y}. 
5: predict j/j G argmaXj^ L(V"/^''). 
6: receive true answer yt. 
7: update Vt = v}^'\ 
8: end for 
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